Hashing Graphs Trees Search Trees Indexing and Multiway Trees File Organization

Introduction

The Table abstract data type

Implementations of the table data structure

Hash Tables

Bucket Array

Hash Function

Hash Code

Compression Functions

Collision-Handling Schemes

Collision likelihoods and load factors for hash tables

Hash Table Implementation

A simple Hash Table in operation

Strategies for dealing with collisions

Linear Probing

Double Hashing

Complexity of hash tables

Understanding the Birthday Paradox:


Imagine you have a group of people, and you want to store their birthdays in an array using a hash function. The hash function assigns a number to each person based on their birthday, where January 1st is 1, January 2nd is 2, and so on, up to December 31st as 365.

Now, if you want to store, let's say, 24 people in an array with 365 locations (one for each day), you might assume that collisions, where two people share the same birthday, would be unlikely. However, this assumption is mistaken, and it leads us to the "von Mises birthday paradox."

The paradox arises from the fact that even with a seemingly low number of people (24), the probability of two people sharing a birthday is surprisingly higher than 50%. This is counterintuitive and is known as the birthday paradox.


Probability of Collision:


To understand this, let's look at the probability, denoted as p(n), that among n people, at least two share the same birthday. To compute this, it's easier to first calculate the probability q(n) that no two people share a birthday and then use the relationship p(n) = 1 - q(n).

For instance:

q(2) = 364/365 because the second person can have a birthday on any day except the first person's birthday.
- q(3) involves multiplying the probabilities for the first three people: (365 * 364 * 363) / (3653).


As we extend this to general cases (n > 1), the formula becomes q(n) = 365! / (365n * (365-n)!), where "!" denotes a factorial. The surprising result is that when n = 23, p(23) is slightly more than 50%, meaning that with 23 people, it's more likely than not that at least two share the same birthday.


Implications for Hash Tables:


Now, let's apply this insight to hash tables. If 23 random locations in a table of size 365 have more than a 50% chance of overlapping, it suggests that collisions will occur in any hash table unless it uses an impractical amount of memory.

This observation brings us to the concept of the "load factor" of a hash table. The load factor (λ) is defined as the ratio of the number of entries (n) to the size of the table (m). So, λ = n/m. A load factor of 0.25 means the table is 25% full, 0.50 means 50% full, and so on.


The Load Factor and Collision Probability:


The probability of a collision for the next key to be inserted is proportional to the load factor (λ). If λ is 0.50, there's a 50% chance of a collision for the next key. This assumes that keys are equally likely, and the hash function evenly distributes them in the array.

To minimize collisions, it's wise to keep the load factor low. A commonly quoted maximum is 50%, as beyond 80%, the performance of the hash table deteriorates considerably.


Conclusion


In conclusion, the birthday paradox illustrates that collisions are more likely in hash tables than one might intuitively think. Understanding the load factor helps quantify how full a hash table is, allowing us to manage collision probabilities effectively. Keeping the load factor low is crucial for optimizing the performance of hash tables. The insights from the von Mises birthday paradox provide a fascinating and practical perspective on collision probabilities in hashing scenarios.

Load Factor


The load factor of a hash table is the ratio of the number of entries (n) to the size of the table (N). It indicates how full the hash table is and is calculated as λ = n/N. Keeping the load factor below certain thresholds ensures efficient operation of the hash table. For open-addressing schemes, maintaining λ < 0.5 is recommended, while for separate chaining, λ < 0.9 is advisable. When the load factor approaches 1, hash table operations can become inefficient due to increased clustering of items, leading to longer probing times and potentially linear expected running times for operations.


Hash Table


Hash tables, known as hashing, offer speedy insertions, deletions, and finds with constant average time. Unlike binary search trees, they excel at unordered operations but lack efficiency in tasks like finding minimum or maximum values and printing the entire table in sorted order.


Collision


Hash functions assign unique addresses to keys, but collisions occur when multiple keys map to the same address. This means two records end up in the same spot. Creating a perfect hash function is challenging, making collisions inevitable.